Fixing Boundary Violations

What is This Research All About?

Improving the truncated regression model (abbreviated as TRM) by fixing boundary violations

This research only has one goal: try to fix boundary violations when the truncated regression model is applied. For instance, a party vote share is naturally confined within a range from 0% to 100%. If our model gives a predicted value below 0% or above 100%, this result is a boundary violation and under no circumstance can this happen. Applying the technique of constrained optimization in nonlinear programming, I successfully develop a revised TRM model that solves the boundary violation problem.

Don't confuse the truncated regression with the censored regression

Truncated Regression differs from censored regression in two aspects. First, according to Takeshi Amemiya (1984: 3):

"Tobit models refer to regression models in which the range of the dependent variable is constrained in some way. In economics, such a model was first suggested in a pioneering work by Tobin (1958). He analyzed household expenditure on durable goods using a regression model which specifically took account of the fact that the expenditure (the dependent variable of his regression model) cannot be negative. Tobin called his model the model of limited dependent variables. It and its various generalizations are known popularly among economists as Tobit models, a phrase coined by Goldberger (1964) because of similarities to probit models. These models are also known as censored or truncated regression models. The model is called truncated if the observations outside a specified range are totally lost and censored if one can at least observe the exogenous variables."

In other words, we throw out the information of DV and IV for the observations that have inadmissible DV values if the truncated regression model is applied. But if we apply the censored regression model, we keep these observations by changing the inadmissible values to the nearest censoring point. Therefore, two models assume different data-generating mechanisms.

Second, since two models have different data-generating mechanisms, they have different likelihood functions. Let the dependent variable has an upper and lower limit $b$ and $a$, the likelihood function for truncated regression is
\begin{equation*}
L_{TRM}=\prod\limits_{i=1}^{n}{\left\{\frac{\exp\left(\frac{-\left( {{y}_{i}}-\boldsymbol{{{x}_{i}}\beta} \right)}{2{{\sigma }^{2}}} \right)}{\int_{a}^{b}{\exp \left( \frac{-\left( y-\boldsymbol{{{x}_{i}}\beta} \right)}{2{{\sigma }^{2}}} \right)dy}} \right\}}.
\end{equation*}

On the other hand, the likelihood function for censored regression is
\begin{equation*}
L_{CRM}=\prod\limits_{i=1}^{n}\Phi {{\left( \frac{a-\boldsymbol{{{x}_{i}}\beta} }{\sigma}\right)}^{{{d}_{l}}}} {{\left(1-\Phi \left( \frac{b-\boldsymbol{{{x}_{i}}\beta} }{\sigma } \right) \right)}^{{{d}_{u}}}} {{\left( \frac{1}{\sigma }\phi \left( \frac{y_{i}-\boldsymbol{{{x}_{i}}\beta} }{\sigma } \right) \right)}^{1-{{d}_{l}}-{{d}_{u}}}},
\end{equation*}
where the indexed variable is defined as
\begin{equation*}
{d}_{l}=
\begin{cases}
1 &\text{if $y_{i}<a$}\\
0 &\text{if $y_{i}>a$,}
\end{cases}
,\qquad {d}_{u}=
\begin{cases}
1 &\text{if $y_{i}>b$}\\
0 &\text{if $y_{i}<b$.}
\end{cases}
\end{equation*}

Apparently, truncated regression and censored regression are two different models.

Don't confuse the truncated regression with the regression with a latent variable

Some might argue that we can simply assume a latent untruncated normal distribution and add a linking function to transform all the predicted DV values into admissible values. In this way, we don't need to apply the truncated regression model. This argument has nothing wrong, but it simply assumes another data-generating mechanism and therefore adopts a different likelihood function. Therefore, what this argument proposes is beyond the scope of my research because my work intends to improve the current truncated regression model without assuming an extra data-generating mechanism such as censoring or selection bias. We should not compare apples and oranges. ⁱ

Don't confuse the truncated regression with the regression with a selection mechanism

The Heckman model (Heckman, 1979) is a popular model recent year in political science literature. However, it is designed to handle the selection bias problem when $y$ (DV) and $x$ (IV) is selected based the value of $z$ (selection variable), where $z$ is subject to a censoring mechanism. Specifically, the Heckman model assumes a bivariate normal distribution of the error terms for the selection and outcome regressions with a correlation coefficient. (Sigelman and Zeng, 1999, 177) The empirical truncation of the outcome dependent variable depends on the selection mechanism that is specified in the Heckman model. Apparently, the censored and sample-selected models do not directly assume the dependent variable as univariate truncated normal, and they both add an additional assumption to the data-generating process that might not be true. Therefore, the truncated regression and the Heckman model are completely different.

____________________

Footnote

ⁱ In contemporary statistical science, the likelihood theory is a crucial paradigm of inference for data analysis. (Royall, 1997, xiii) It provides an unifying approach of statistical modeling to both frequentists and Bayesians with the criterion of maximum likelihood. (Azzalini, 1996) The rapid development of political methodology in the last two decades also witnesses the establishment of the likelihood paradigm in the scientific study of politics. (King, 1998) As a model of inference, the fundamental assumption of the likelihood theory is the likelihood principle, stating that "all evidence, which is obtained from an experiment, about an unknown quantity $\theta$, is contained in the likelihood function of $\theta$ for the given function." (Berger and Wolpert,1984, vii) In other words, given the fact that the likelihood function is defined by the probability density (or mass) function, we must make a distributional assumption of the dependent variable to derive a likelihood function. The plausibility of such a distributional assumption is therefore vital to the validity of statistical inference. And the assumption of the data-generating mechanism largely determines which distribution assumption should be applied. (This footnote originally appears in my another artcile "Solving Problems in the Panel Regression Model for Truncated Dependent Variables: A Constrained Optimization Method")